28 - Deep Learning - Architectures Part 3 [ID:16211]
50 von 146 angezeigt

Welcome back to Deep Learning and today we want to discuss a little more architectures

and in particular the really deep ones.

So here we are really going towards deep learning.

Instead of what humans might need just dozens of examples, these things will need millions.

If you want to train deeper models with all the things that we've seen so far, you see

that we go into a certain kind of saturation.

If you want to go deeper, then you just add layers on top and you would hope that the

training error would go down.

But if you look very carefully, you can see that the 20 layer network has a lower training

error and also a lower test set error than for example a 56 layer model.

So we cannot just increase the layers and layers and layers and hope that things get

better.

Because it doesn't work.

And this effect is not just caused by overfitting.

We are building layers on top, so there must be other reasons and it's likely that it's

reasons that are related to the vanishing gradient problem.

Maybe one reason could be the relu or the initialization or the problem of the internal

covariate shift, where we try then batch normalization, elus and celus.

But we still have a problem with the poor propagation of activations and gradients.

And we see that if we try to build those very deep models, we get problems with vanishing

gradients and we can't train the early layers, which even results in worse results on the

training set.

So I have one solution for you and these are residual units.

Residual units are a very cool idea.

So what they propose to do is not learn the direct mapping f of x, but instead we learn

the residual mapping.

So we want to learn h of x and h of x is the difference between f of x and x.

So we could also express it in a different way and this is actually then how it's implemented,

that you compute your network f of x as some layer h of x plus x.

So the trainable part is now essentially in a side branch and the side branch is h of

x, that's the trainable one and on the main branch we have just some x plus the side branch

that will deliver our estimate y.

In the original implementation of residual blocks, there was still a difference.

It was not exactly like we had it on the previous slide.

So we had the side branch where we have a waiting layer, batch normalization, relu,

weights, another batch normalization, then the addition and then another non-linearity,

the relu for one residual block.

And this was later then changed into using batch norm, relu weight, batch norm, relu

weight for the residual block and it turned out that this kind of configuration was more

stable and we essentially have the identity that is back propagated on the plus branch.

So this is very nice because we can then propagate the gradient back into the early layers just

with this addition and we get a much more stable back propagation this way.

This then brings us to a complete residual network.

So we cut it at the bottom and show the bottom part on the right hand side and this is a

comparison between VGG, the 43 layer plane network and the 43 layer residual network

and you can see that there's essentially these skip connections that are being introduced

that allow us to skip over this step or then back propagate into their respective layers.

There's also down sampling involved and then of course the skip connections also have to

be down sampled.

So this is why we have dotted connections at these positions and we can see that VGG

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:13:23 Min

Aufnahmedatum

2020-05-19

Hochgeladen am

2020-05-20 01:46:14

Sprache

en-US

Deep Learning - Architectures Part 3

This video discusses the ideas of residual connections in deep networks that allow going from 20 to more than 1000 layers.

Video References:
Lex Fridman's Channel
Matrix Scene

Further Reading:
A gentle Introduction to Deep Learning

Tags

backpropagation architectures convolution Evaluation artificial intelligence deep learning machine learning pattern recognition Feedforward Networks Gradient descent Resnet
Einbetten
Wordpress FAU Plugin
iFrame
Teilen